Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery. [http://machinelearningmastery.com/]

SUMMARY: [Sample Paragraph - The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Connectionist Bench dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.]

INTRODUCTION: [Sample Paragraph - The data file patterns obtained by bouncing sonar signals off a metal cylinder or a rock at various angles and under various conditions. The transmitted sonar signal is a frequency-modulated chirp, rising in frequency. The data set contains signals obtained from a variety of different aspect angles, spanning 90 degrees for the cylinder and 180 degrees for the rock. Each pattern is a set of 60 numbers in the range 0.0 to 1.0. Each number represents the energy within a particular frequency band, integrated over a certain period of time.]

ANALYSIS: [Sample Paragraph - The baseline performance of the machine learning algorithms achieved an average accuracy of 78.39%. Two algorithms (Random Forest and Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in the top overall result and achieved an accuracy metric of 85.44%. After applying the optimized parameters, the Random Forest algorithm processed the testing dataset with an accuracy of 80.39%, which was below the prediction from the training data.]

CONCLUSION: [Sample Paragraph - For this iteration, the Random Forest algorithm achieved the best overall training and validation results. For this dataset, the Random Forest algorithm could be considered for further modeling.]

Dataset Used: [Connectionist Bench (Sonar, Mines vs. Rocks) Data Set]

Dataset ML Model: Binary classification with [numerical | categorical] attributes

Dataset Reference: [https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+%28Sonar%2C+Mines+vs.+Rocks%29]

One potential source of performance benchmarks: [https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+%28Sonar%2C+Mines+vs.+Rocks%29]

The project aims to touch on the following areas:

  1. Document a predictive modeling problem end-to-end.
  2. Explore data cleaning and transformation options
  3. Explore non-ensemble and ensemble algorithms for baseline model performance
  4. Explore algorithm tuning techniques for improving model performance

Any predictive modeling machine learning project genrally can be broken down into about six major tasks:

  1. Prepare Environment
  2. Summarize Data
  3. Prepare Data
  4. Model and Evaluate Algorithms
  5. Improve Accuracy or Results
  6. Finalize Model and Present Results

1. Prepare Environment

1.a) Load libraries and packages

startTimeScript <- proc.time()
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
library(corrplot)
## corrplot 0.84 loaded
library(DMwR)
## Loading required package: grid
## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(Hmisc)
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(mailR)
## Registered S3 method overwritten by 'R.oo':
##   method        from       
##   throw.default R.methodsS3
library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
library(stringr)

1.b) Set up the controlling parameters and functions

# Create the random seed number for reproducible results
seedNum <- 888

# Set up the notifyStatus flag to stop sending progress emails (setting to TRUE will send status emails!)
notifyStatus <- FALSE
# Run algorithms using 10-fold cross validation
control <- trainControl(method="repeatedcv", number=10, repeats=1)
metricTarget <- "Accuracy"
# Set up the email notification function
email_notify <- function(msg=""){
  sender <- Sys.getenv("MAIL_SENDER")
  receiver <- Sys.getenv("MAIL_RECEIVER")
  gateway <- Sys.getenv("SMTP_GATEWAY")
  smtpuser <- Sys.getenv("SMTP_USERNAME")
  password <- Sys.getenv("SMTP_PASSWORD")
  sbj_line <- "Notification from R Binary Classification Script"
  send.mail(
    from = sender,
    to = receiver,
    subject= sbj_line,
    body = msg,
    smtp = list(host.name = gateway, port = 587, user.name = smtpuser, passwd = password, ssl = TRUE),
    authenticate = TRUE,
    send = TRUE)
}
if (notifyStatus) email_notify(paste("Library and Data Loading has begun!",date()))

1.c) Load dataset

# Slicing up the document path to get the final destination file name
dataset_path <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data'
doc_path_list <- str_split(dataset_path, "/")
dest_file <- doc_path_list[[1]][length(doc_path_list[[1]])]

if (!file.exists(dest_file)) {
  # Download the document from the website
  cat("Downloading", dataset_path, "as", dest_file, "\n")
  download.file(dataset_path, dest_file, mode = "wb")
  cat(dest_file, "downloaded!\n")
#  unzip(dest_file)
#  cat(dest_file, "unpacked!\n")
}

inputFile <- dest_file
colNames <- paste0("attr",1:60)
colNames <- c(colNames, 'targetVar')
Xy_original <- read.csv(inputFile, sep=',', header=FALSE, col.names = colNames)

# Different ways of reading and processing the input dataset. Saving these for future references.
#X_train <- read.fwf("X_train.txt", widths = widthVector, col.names = colNames)
#y_train <- read.csv("y_train.txt", header = FALSE, col.names = c("targetVar"))
#y_train$targetVar <- as.factor(y_train$targetVar)
#Xy_train <- cbind(X_train, y_train)
# Take a peek at the dataframe after the import
head(Xy_original)
##    attr1  attr2  attr3  attr4  attr5  attr6  attr7  attr8  attr9 attr10
## 1 0.0200 0.0371 0.0428 0.0207 0.0954 0.0986 0.1539 0.1601 0.3109 0.2111
## 2 0.0453 0.0523 0.0843 0.0689 0.1183 0.2583 0.2156 0.3481 0.3337 0.2872
## 3 0.0262 0.0582 0.1099 0.1083 0.0974 0.2280 0.2431 0.3771 0.5598 0.6194
## 4 0.0100 0.0171 0.0623 0.0205 0.0205 0.0368 0.1098 0.1276 0.0598 0.1264
## 5 0.0762 0.0666 0.0481 0.0394 0.0590 0.0649 0.1209 0.2467 0.3564 0.4459
## 6 0.0286 0.0453 0.0277 0.0174 0.0384 0.0990 0.1201 0.1833 0.2105 0.3039
##   attr11 attr12 attr13 attr14 attr15 attr16 attr17 attr18 attr19 attr20
## 1 0.1609 0.1582 0.2238 0.0645 0.0660 0.2273 0.3100 0.2999 0.5078 0.4797
## 2 0.4918 0.6552 0.6919 0.7797 0.7464 0.9444 1.0000 0.8874 0.8024 0.7818
## 3 0.6333 0.7060 0.5544 0.5320 0.6479 0.6931 0.6759 0.7551 0.8929 0.8619
## 4 0.0881 0.1992 0.0184 0.2261 0.1729 0.2131 0.0693 0.2281 0.4060 0.3973
## 5 0.4152 0.3952 0.4256 0.4135 0.4528 0.5326 0.7306 0.6193 0.2032 0.4636
## 6 0.2988 0.4250 0.6343 0.8198 1.0000 0.9988 0.9508 0.9025 0.7234 0.5122
##   attr21 attr22 attr23 attr24 attr25 attr26 attr27 attr28 attr29 attr30
## 1 0.5783 0.5071 0.4328 0.5550 0.6711 0.6415 0.7104 0.8080 0.6791 0.3857
## 2 0.5212 0.4052 0.3957 0.3914 0.3250 0.3200 0.3271 0.2767 0.4423 0.2028
## 3 0.7974 0.6737 0.4293 0.3648 0.5331 0.2413 0.5070 0.8533 0.6036 0.8514
## 4 0.2741 0.3690 0.5556 0.4846 0.3140 0.5334 0.5256 0.2520 0.2090 0.3559
## 5 0.4148 0.4292 0.5730 0.5399 0.3161 0.2285 0.6995 1.0000 0.7262 0.4724
## 6 0.2074 0.3985 0.5890 0.2872 0.2043 0.5782 0.5389 0.3750 0.3411 0.5067
##   attr31 attr32 attr33 attr34 attr35 attr36 attr37 attr38 attr39 attr40
## 1 0.1307 0.2604 0.5121 0.7547 0.8537 0.8507 0.6692 0.6097 0.4943 0.2744
## 2 0.3788 0.2947 0.1984 0.2341 0.1306 0.4182 0.3835 0.1057 0.1840 0.1970
## 3 0.8512 0.5045 0.1862 0.2709 0.4232 0.3043 0.6116 0.6756 0.5375 0.4719
## 4 0.6260 0.7340 0.6120 0.3497 0.3953 0.3012 0.5408 0.8814 0.9857 0.9167
## 5 0.5103 0.5459 0.2881 0.0981 0.1951 0.4181 0.4604 0.3217 0.2828 0.2430
## 6 0.5580 0.4778 0.3299 0.2198 0.1407 0.2856 0.3807 0.4158 0.4054 0.3296
##   attr41 attr42 attr43 attr44 attr45 attr46 attr47 attr48 attr49 attr50
## 1 0.0510 0.2834 0.2825 0.4256 0.2641 0.1386 0.1051 0.1343 0.0383 0.0324
## 2 0.1674 0.0583 0.1401 0.1628 0.0621 0.0203 0.0530 0.0742 0.0409 0.0061
## 3 0.4647 0.2587 0.2129 0.2222 0.2111 0.0176 0.1348 0.0744 0.0130 0.0106
## 4 0.6121 0.5006 0.3210 0.3202 0.4295 0.3654 0.2655 0.1576 0.0681 0.0294
## 5 0.1979 0.2444 0.1847 0.0841 0.0692 0.0528 0.0357 0.0085 0.0230 0.0046
## 6 0.2707 0.2650 0.0723 0.1238 0.1192 0.1089 0.0623 0.0494 0.0264 0.0081
##   attr51 attr52 attr53 attr54 attr55 attr56 attr57 attr58 attr59 attr60
## 1 0.0232 0.0027 0.0065 0.0159 0.0072 0.0167 0.0180 0.0084 0.0090 0.0032
## 2 0.0125 0.0084 0.0089 0.0048 0.0094 0.0191 0.0140 0.0049 0.0052 0.0044
## 3 0.0033 0.0232 0.0166 0.0095 0.0180 0.0244 0.0316 0.0164 0.0095 0.0078
## 4 0.0241 0.0121 0.0036 0.0150 0.0085 0.0073 0.0050 0.0044 0.0040 0.0117
## 5 0.0156 0.0031 0.0054 0.0105 0.0110 0.0015 0.0072 0.0048 0.0107 0.0094
## 6 0.0104 0.0045 0.0014 0.0038 0.0013 0.0089 0.0057 0.0027 0.0051 0.0062
##   targetVar
## 1         R
## 2         R
## 3         R
## 4         R
## 5         R
## 6         R
sapply(Xy_original, class)
##     attr1     attr2     attr3     attr4     attr5     attr6     attr7 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##     attr8     attr9    attr10    attr11    attr12    attr13    attr14 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr15    attr16    attr17    attr18    attr19    attr20    attr21 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr22    attr23    attr24    attr25    attr26    attr27    attr28 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr29    attr30    attr31    attr32    attr33    attr34    attr35 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr36    attr37    attr38    attr39    attr40    attr41    attr42 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr43    attr44    attr45    attr46    attr47    attr48    attr49 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr50    attr51    attr52    attr53    attr54    attr55    attr56 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr57    attr58    attr59    attr60 targetVar 
## "numeric" "numeric" "numeric" "numeric"  "factor"
sapply(Xy_original, function(x) sum(is.na(x)))
##     attr1     attr2     attr3     attr4     attr5     attr6     attr7 
##         0         0         0         0         0         0         0 
##     attr8     attr9    attr10    attr11    attr12    attr13    attr14 
##         0         0         0         0         0         0         0 
##    attr15    attr16    attr17    attr18    attr19    attr20    attr21 
##         0         0         0         0         0         0         0 
##    attr22    attr23    attr24    attr25    attr26    attr27    attr28 
##         0         0         0         0         0         0         0 
##    attr29    attr30    attr31    attr32    attr33    attr34    attr35 
##         0         0         0         0         0         0         0 
##    attr36    attr37    attr38    attr39    attr40    attr41    attr42 
##         0         0         0         0         0         0         0 
##    attr43    attr44    attr45    attr46    attr47    attr48    attr49 
##         0         0         0         0         0         0         0 
##    attr50    attr51    attr52    attr53    attr54    attr55    attr56 
##         0         0         0         0         0         0         0 
##    attr57    attr58    attr59    attr60 targetVar 
##         0         0         0         0         0

1.d) Data Cleaning

# Not applicable for this iteration of the project
# Sample code for performing basic data cleaning tasks

# Dropping features
# Xy_original$column_name <- NULL

# Mark missing values
# invalid <- 0
# Xy_original$column_name[Xy_original$column_name==invalid] <- NA

# Impute missing values
# column_median <- median(Xy_original$column_name, na.rm = TRUE)
# Xy_original$column_name[Xy_original$column_name==0] <- column_median
# Xy_original$column_name <- with(Xy_original, impute(column_name, cholumn_median))

# Convert columns from one data type to another
# Xy_original$column_name <- as.integer(Xy_original$column_name)
# Xy_original$column_name <- as.factor(Xy_original$column_name)
# Take a peek at the dataframe after the cleaning
head(Xy_original)
##    attr1  attr2  attr3  attr4  attr5  attr6  attr7  attr8  attr9 attr10
## 1 0.0200 0.0371 0.0428 0.0207 0.0954 0.0986 0.1539 0.1601 0.3109 0.2111
## 2 0.0453 0.0523 0.0843 0.0689 0.1183 0.2583 0.2156 0.3481 0.3337 0.2872
## 3 0.0262 0.0582 0.1099 0.1083 0.0974 0.2280 0.2431 0.3771 0.5598 0.6194
## 4 0.0100 0.0171 0.0623 0.0205 0.0205 0.0368 0.1098 0.1276 0.0598 0.1264
## 5 0.0762 0.0666 0.0481 0.0394 0.0590 0.0649 0.1209 0.2467 0.3564 0.4459
## 6 0.0286 0.0453 0.0277 0.0174 0.0384 0.0990 0.1201 0.1833 0.2105 0.3039
##   attr11 attr12 attr13 attr14 attr15 attr16 attr17 attr18 attr19 attr20
## 1 0.1609 0.1582 0.2238 0.0645 0.0660 0.2273 0.3100 0.2999 0.5078 0.4797
## 2 0.4918 0.6552 0.6919 0.7797 0.7464 0.9444 1.0000 0.8874 0.8024 0.7818
## 3 0.6333 0.7060 0.5544 0.5320 0.6479 0.6931 0.6759 0.7551 0.8929 0.8619
## 4 0.0881 0.1992 0.0184 0.2261 0.1729 0.2131 0.0693 0.2281 0.4060 0.3973
## 5 0.4152 0.3952 0.4256 0.4135 0.4528 0.5326 0.7306 0.6193 0.2032 0.4636
## 6 0.2988 0.4250 0.6343 0.8198 1.0000 0.9988 0.9508 0.9025 0.7234 0.5122
##   attr21 attr22 attr23 attr24 attr25 attr26 attr27 attr28 attr29 attr30
## 1 0.5783 0.5071 0.4328 0.5550 0.6711 0.6415 0.7104 0.8080 0.6791 0.3857
## 2 0.5212 0.4052 0.3957 0.3914 0.3250 0.3200 0.3271 0.2767 0.4423 0.2028
## 3 0.7974 0.6737 0.4293 0.3648 0.5331 0.2413 0.5070 0.8533 0.6036 0.8514
## 4 0.2741 0.3690 0.5556 0.4846 0.3140 0.5334 0.5256 0.2520 0.2090 0.3559
## 5 0.4148 0.4292 0.5730 0.5399 0.3161 0.2285 0.6995 1.0000 0.7262 0.4724
## 6 0.2074 0.3985 0.5890 0.2872 0.2043 0.5782 0.5389 0.3750 0.3411 0.5067
##   attr31 attr32 attr33 attr34 attr35 attr36 attr37 attr38 attr39 attr40
## 1 0.1307 0.2604 0.5121 0.7547 0.8537 0.8507 0.6692 0.6097 0.4943 0.2744
## 2 0.3788 0.2947 0.1984 0.2341 0.1306 0.4182 0.3835 0.1057 0.1840 0.1970
## 3 0.8512 0.5045 0.1862 0.2709 0.4232 0.3043 0.6116 0.6756 0.5375 0.4719
## 4 0.6260 0.7340 0.6120 0.3497 0.3953 0.3012 0.5408 0.8814 0.9857 0.9167
## 5 0.5103 0.5459 0.2881 0.0981 0.1951 0.4181 0.4604 0.3217 0.2828 0.2430
## 6 0.5580 0.4778 0.3299 0.2198 0.1407 0.2856 0.3807 0.4158 0.4054 0.3296
##   attr41 attr42 attr43 attr44 attr45 attr46 attr47 attr48 attr49 attr50
## 1 0.0510 0.2834 0.2825 0.4256 0.2641 0.1386 0.1051 0.1343 0.0383 0.0324
## 2 0.1674 0.0583 0.1401 0.1628 0.0621 0.0203 0.0530 0.0742 0.0409 0.0061
## 3 0.4647 0.2587 0.2129 0.2222 0.2111 0.0176 0.1348 0.0744 0.0130 0.0106
## 4 0.6121 0.5006 0.3210 0.3202 0.4295 0.3654 0.2655 0.1576 0.0681 0.0294
## 5 0.1979 0.2444 0.1847 0.0841 0.0692 0.0528 0.0357 0.0085 0.0230 0.0046
## 6 0.2707 0.2650 0.0723 0.1238 0.1192 0.1089 0.0623 0.0494 0.0264 0.0081
##   attr51 attr52 attr53 attr54 attr55 attr56 attr57 attr58 attr59 attr60
## 1 0.0232 0.0027 0.0065 0.0159 0.0072 0.0167 0.0180 0.0084 0.0090 0.0032
## 2 0.0125 0.0084 0.0089 0.0048 0.0094 0.0191 0.0140 0.0049 0.0052 0.0044
## 3 0.0033 0.0232 0.0166 0.0095 0.0180 0.0244 0.0316 0.0164 0.0095 0.0078
## 4 0.0241 0.0121 0.0036 0.0150 0.0085 0.0073 0.0050 0.0044 0.0040 0.0117
## 5 0.0156 0.0031 0.0054 0.0105 0.0110 0.0015 0.0072 0.0048 0.0107 0.0094
## 6 0.0104 0.0045 0.0014 0.0038 0.0013 0.0089 0.0057 0.0027 0.0051 0.0062
##   targetVar
## 1         R
## 2         R
## 3         R
## 4         R
## 5         R
## 6         R
sapply(Xy_original, class)
##     attr1     attr2     attr3     attr4     attr5     attr6     attr7 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##     attr8     attr9    attr10    attr11    attr12    attr13    attr14 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr15    attr16    attr17    attr18    attr19    attr20    attr21 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr22    attr23    attr24    attr25    attr26    attr27    attr28 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr29    attr30    attr31    attr32    attr33    attr34    attr35 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr36    attr37    attr38    attr39    attr40    attr41    attr42 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr43    attr44    attr45    attr46    attr47    attr48    attr49 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr50    attr51    attr52    attr53    attr54    attr55    attr56 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr57    attr58    attr59    attr60 targetVar 
## "numeric" "numeric" "numeric" "numeric"  "factor"
sapply(Xy_original, function(x) sum(is.na(x)))
##     attr1     attr2     attr3     attr4     attr5     attr6     attr7 
##         0         0         0         0         0         0         0 
##     attr8     attr9    attr10    attr11    attr12    attr13    attr14 
##         0         0         0         0         0         0         0 
##    attr15    attr16    attr17    attr18    attr19    attr20    attr21 
##         0         0         0         0         0         0         0 
##    attr22    attr23    attr24    attr25    attr26    attr27    attr28 
##         0         0         0         0         0         0         0 
##    attr29    attr30    attr31    attr32    attr33    attr34    attr35 
##         0         0         0         0         0         0         0 
##    attr36    attr37    attr38    attr39    attr40    attr41    attr42 
##         0         0         0         0         0         0         0 
##    attr43    attr44    attr45    attr46    attr47    attr48    attr49 
##         0         0         0         0         0         0         0 
##    attr50    attr51    attr52    attr53    attr54    attr55    attr56 
##         0         0         0         0         0         0         0 
##    attr57    attr58    attr59    attr60 targetVar 
##         0         0         0         0         0

1.e) Splitting Data into Training and Test Sets

# Use variable totCol to hold the number of columns in the dataframe
totCol <- ncol(Xy_original)

# Set up variable totAttr for the total number of attribute columns
totAttr <- totCol-1
# targetCol variable indicates the column location of the target/class variable
# If the first column, set targetCol to 1. If the last column, set targetCol to totCol
# if (targetCol <> 1) and (targetCol <> totCol), be aware when slicing up the dataframes for visualization! 
targetCol <- totCol

# Standardize the class column to the name of targetVar if applicable
#colnames(Xy_original)[targetCol] <- "targetVar"
#Xy_original$targetVar <- relevel(Xy_original$targetVar,"pos")
# Create various sub-datasets for visualization and cleaning/transformation operations.
set.seed(seedNum)

# Use 75% of the data to train the models and the remaining for testing/validation
training_index <- createDataPartition(Xy_original$targetVar, p=0.75, list=FALSE)
Xy_train <- Xy_original[training_index,]
Xy_test <- Xy_original[-training_index,]

if (targetCol==1) {
  X_train <- Xy_train[,(targetCol+1):totCol]
  y_train <- Xy_train[,targetCol]
  y_test <- Xy_test[,targetCol]
} else {
  X_train <- Xy_train[,1:(totAttr)]
  y_train <- Xy_train[,totCol]
  y_test <- Xy_test[,totCol]
}

1.f) Set up the parameters for data visualization

# Set up the number of row and columns for visualization display. dispRow * dispCol should be >= totAttr
dispCol <- 4
if (totAttr%%dispCol == 0) {
dispRow <- totAttr%/%dispCol
} else {
dispRow <- (totAttr%/%dispCol) + 1
}
cat("Will attempt to create graphics grid (col x row): ", dispCol, ' by ', dispRow)
## Will attempt to create graphics grid (col x row):  4  by  15
if (notifyStatus) email_notify(paste("Library and Data Loading completed!",date()))

2. Summarize Data

To gain a better understanding of the data that we have on-hand, we will leverage a number of descriptive statistics and data visualization techniques. The plan is to use the results to consider new questions, review assumptions, and validate hypotheses that we can investigate later with specialized models.

if (notifyStatus) email_notify(paste("Data Summarization and Visualization has begun!",date()))

2.a) Descriptive statistics

2.a.i) Peek at the data itself

head(Xy_train)
##    attr1  attr2  attr3  attr4  attr5  attr6  attr7  attr8  attr9 attr10
## 1 0.0200 0.0371 0.0428 0.0207 0.0954 0.0986 0.1539 0.1601 0.3109 0.2111
## 2 0.0453 0.0523 0.0843 0.0689 0.1183 0.2583 0.2156 0.3481 0.3337 0.2872
## 4 0.0100 0.0171 0.0623 0.0205 0.0205 0.0368 0.1098 0.1276 0.0598 0.1264
## 6 0.0286 0.0453 0.0277 0.0174 0.0384 0.0990 0.1201 0.1833 0.2105 0.3039
## 7 0.0317 0.0956 0.1321 0.1408 0.1674 0.1710 0.0731 0.1401 0.2083 0.3513
## 8 0.0519 0.0548 0.0842 0.0319 0.1158 0.0922 0.1027 0.0613 0.1465 0.2838
##   attr11 attr12 attr13 attr14 attr15 attr16 attr17 attr18 attr19 attr20
## 1 0.1609 0.1582 0.2238 0.0645 0.0660 0.2273 0.3100 0.2999 0.5078 0.4797
## 2 0.4918 0.6552 0.6919 0.7797 0.7464 0.9444 1.0000 0.8874 0.8024 0.7818
## 4 0.0881 0.1992 0.0184 0.2261 0.1729 0.2131 0.0693 0.2281 0.4060 0.3973
## 6 0.2988 0.4250 0.6343 0.8198 1.0000 0.9988 0.9508 0.9025 0.7234 0.5122
## 7 0.1786 0.0658 0.0513 0.3752 0.5419 0.5440 0.5150 0.4262 0.2024 0.4233
## 8 0.2802 0.3086 0.2657 0.3801 0.5626 0.4376 0.2617 0.1199 0.6676 0.9402
##   attr21 attr22 attr23 attr24 attr25 attr26 attr27 attr28 attr29 attr30
## 1 0.5783 0.5071 0.4328 0.5550 0.6711 0.6415 0.7104 0.8080 0.6791 0.3857
## 2 0.5212 0.4052 0.3957 0.3914 0.3250 0.3200 0.3271 0.2767 0.4423 0.2028
## 4 0.2741 0.3690 0.5556 0.4846 0.3140 0.5334 0.5256 0.2520 0.2090 0.3559
## 6 0.2074 0.3985 0.5890 0.2872 0.2043 0.5782 0.5389 0.3750 0.3411 0.5067
## 7 0.7723 0.9735 0.9390 0.5559 0.5268 0.6826 0.5713 0.5429 0.2177 0.2149
## 8 0.7832 0.5352 0.6809 0.9174 0.7613 0.8220 0.8872 0.6091 0.2967 0.1103
##   attr31 attr32 attr33 attr34 attr35 attr36 attr37 attr38 attr39 attr40
## 1 0.1307 0.2604 0.5121 0.7547 0.8537 0.8507 0.6692 0.6097 0.4943 0.2744
## 2 0.3788 0.2947 0.1984 0.2341 0.1306 0.4182 0.3835 0.1057 0.1840 0.1970
## 4 0.6260 0.7340 0.6120 0.3497 0.3953 0.3012 0.5408 0.8814 0.9857 0.9167
## 6 0.5580 0.4778 0.3299 0.2198 0.1407 0.2856 0.3807 0.4158 0.4054 0.3296
## 7 0.5811 0.6323 0.2965 0.1873 0.2969 0.5163 0.6153 0.4283 0.5479 0.6133
## 8 0.1318 0.0624 0.0990 0.4006 0.3666 0.1050 0.1915 0.3930 0.4288 0.2546
##   attr41 attr42 attr43 attr44 attr45 attr46 attr47 attr48 attr49 attr50
## 1 0.0510 0.2834 0.2825 0.4256 0.2641 0.1386 0.1051 0.1343 0.0383 0.0324
## 2 0.1674 0.0583 0.1401 0.1628 0.0621 0.0203 0.0530 0.0742 0.0409 0.0061
## 4 0.6121 0.5006 0.3210 0.3202 0.4295 0.3654 0.2655 0.1576 0.0681 0.0294
## 6 0.2707 0.2650 0.0723 0.1238 0.1192 0.1089 0.0623 0.0494 0.0264 0.0081
## 7 0.5017 0.2377 0.1957 0.1749 0.1304 0.0597 0.1124 0.1047 0.0507 0.0159
## 8 0.1151 0.2196 0.1879 0.1437 0.2146 0.2360 0.1125 0.0254 0.0285 0.0178
##   attr51 attr52 attr53 attr54 attr55 attr56 attr57 attr58 attr59 attr60
## 1 0.0232 0.0027 0.0065 0.0159 0.0072 0.0167 0.0180 0.0084 0.0090 0.0032
## 2 0.0125 0.0084 0.0089 0.0048 0.0094 0.0191 0.0140 0.0049 0.0052 0.0044
## 4 0.0241 0.0121 0.0036 0.0150 0.0085 0.0073 0.0050 0.0044 0.0040 0.0117
## 6 0.0104 0.0045 0.0014 0.0038 0.0013 0.0089 0.0057 0.0027 0.0051 0.0062
## 7 0.0195 0.0201 0.0248 0.0131 0.0070 0.0138 0.0092 0.0143 0.0036 0.0103
## 8 0.0052 0.0081 0.0120 0.0045 0.0121 0.0097 0.0085 0.0047 0.0048 0.0053
##   targetVar
## 1         R
## 2         R
## 4         R
## 6         R
## 7         R
## 8         R

2.a.ii) Dimensions of the dataset

dim(Xy_train)
## [1] 157  61

2.a.iii) Types of the attribute

sapply(Xy_train, class)
##     attr1     attr2     attr3     attr4     attr5     attr6     attr7 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##     attr8     attr9    attr10    attr11    attr12    attr13    attr14 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr15    attr16    attr17    attr18    attr19    attr20    attr21 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr22    attr23    attr24    attr25    attr26    attr27    attr28 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr29    attr30    attr31    attr32    attr33    attr34    attr35 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr36    attr37    attr38    attr39    attr40    attr41    attr42 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr43    attr44    attr45    attr46    attr47    attr48    attr49 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr50    attr51    attr52    attr53    attr54    attr55    attr56 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr57    attr58    attr59    attr60 targetVar 
## "numeric" "numeric" "numeric" "numeric"  "factor"

2.a.iv) Statistical summary of the attributes

summary(Xy_train)
##      attr1             attr2             attr3             attr4        
##  Min.   :0.00150   Min.   :0.00220   Min.   :0.00300   Min.   :0.00610  
##  1st Qu.:0.01350   1st Qu.:0.01720   1st Qu.:0.01900   1st Qu.:0.02450  
##  Median :0.02310   Median :0.03090   Median :0.03460   Median :0.04320  
##  Mean   :0.02978   Mean   :0.03951   Mean   :0.04569   Mean   :0.05538  
##  3rd Qu.:0.03650   3rd Qu.:0.04770   3rd Qu.:0.06040   3rd Qu.:0.06330  
##  Max.   :0.13710   Max.   :0.23390   Max.   :0.30590   Max.   :0.42640  
##      attr5             attr6            attr7            attr8       
##  Min.   :0.00670   Min.   :0.0102   Min.   :0.0033   Min.   :0.0055  
##  1st Qu.:0.03700   1st Qu.:0.0679   1st Qu.:0.0843   1st Qu.:0.0802  
##  Median :0.06170   Median :0.0924   Median :0.1098   Median :0.1130  
##  Mean   :0.07504   Mean   :0.1057   Mean   :0.1234   Mean   :0.1343  
##  3rd Qu.:0.09950   3rd Qu.:0.1354   3rd Qu.:0.1597   3rd Qu.:0.1694  
##  Max.   :0.40100   Max.   :0.3823   Max.   :0.3729   Max.   :0.4566  
##      attr9            attr10           attr11           attr12      
##  Min.   :0.0298   Min.   :0.0113   Min.   :0.0327   Min.   :0.0236  
##  1st Qu.:0.0974   1st Qu.:0.1186   1st Qu.:0.1445   1st Qu.:0.1382  
##  Median :0.1552   Median :0.1895   Median :0.2309   Median :0.2484  
##  Mean   :0.1796   Mean   :0.2084   Mean   :0.2360   Mean   :0.2490  
##  3rd Qu.:0.2361   3rd Qu.:0.2718   3rd Qu.:0.3003   3rd Qu.:0.3259  
##  Max.   :0.6828   Max.   :0.7106   Max.   :0.7342   Max.   :0.6552  
##      attr13           attr14           attr15           attr16      
##  Min.   :0.0184   Min.   :0.0273   Min.   :0.0031   Min.   :0.0162  
##  1st Qu.:0.1770   1st Qu.:0.1806   1st Qu.:0.1721   1st Qu.:0.2036  
##  Median :0.2510   Median :0.2904   Median :0.2950   Median :0.3234  
##  Mean   :0.2775   Mean   :0.3055   Mean   :0.3299   Mean   :0.3853  
##  3rd Qu.:0.3603   3rd Qu.:0.3940   3rd Qu.:0.4725   3rd Qu.:0.5392  
##  Max.   :0.7131   Max.   :0.9970   Max.   :1.0000   Max.   :0.9988  
##      attr17           attr18           attr19           attr20      
##  Min.   :0.0349   Min.   :0.0689   Min.   :0.0494   Min.   :0.0740  
##  1st Qu.:0.2088   1st Qu.:0.2349   1st Qu.:0.2989   1st Qu.:0.3658  
##  Median :0.3232   Median :0.3655   Median :0.4327   Median :0.5167  
##  Mean   :0.4221   Mean   :0.4556   Mean   :0.5064   Mean   :0.5616  
##  3rd Qu.:0.6687   3rd Qu.:0.6888   3rd Qu.:0.7309   3rd Qu.:0.8092  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      attr21           attr22           attr23           attr24      
##  Min.   :0.0512   Min.   :0.0689   Min.   :0.0563   Min.   :0.0239  
##  1st Qu.:0.3906   1st Qu.:0.4075   1st Qu.:0.4611   1st Qu.:0.5470  
##  Median :0.6079   Median :0.6708   Median :0.7022   Median :0.7114  
##  Mean   :0.6094   Mean   :0.6325   Mean   :0.6598   Mean   :0.6861  
##  3rd Qu.:0.8240   3rd Qu.:0.8515   3rd Qu.:0.8626   3rd Qu.:0.8675  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      attr25           attr26           attr27           attr28      
##  Min.   :0.0885   Min.   :0.0921   Min.   :0.0481   Min.   :0.0284  
##  1st Qu.:0.5734   1st Qu.:0.5599   1st Qu.:0.5389   1st Qu.:0.5116  
##  Median :0.7152   Median :0.7529   Median :0.7567   Median :0.7353  
##  Mean   :0.6830   Mean   :0.7074   Mean   :0.7064   Mean   :0.6831  
##  3rd Qu.:0.8675   3rd Qu.:0.8938   3rd Qu.:0.9180   3rd Qu.:0.8752  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      attr29           attr30           attr31           attr32      
##  Min.   :0.0144   Min.   :0.0613   Min.   :0.0482   Min.   :0.0404  
##  1st Qu.:0.4488   1st Qu.:0.3917   1st Qu.:0.3139   1st Qu.:0.2822  
##  Median :0.6790   Median :0.5986   Median :0.4770   Median :0.4219  
##  Mean   :0.6323   Mean   :0.5743   Mean   :0.4968   Mean   :0.4354  
##  3rd Qu.:0.8477   3rd Qu.:0.7575   3rd Qu.:0.6407   3rd Qu.:0.5749  
##  Max.   :1.0000   Max.   :1.0000   Max.   :0.9657   Max.   :0.9306  
##      attr33           attr34           attr35           attr36      
##  Min.   :0.0477   Min.   :0.0212   Min.   :0.0223   Min.   :0.0271  
##  1st Qu.:0.2584   1st Qu.:0.2175   1st Qu.:0.1757   1st Qu.:0.1547  
##  Median :0.3903   Median :0.3409   Median :0.3108   Median :0.3195  
##  Mean   :0.4122   Mean   :0.3938   Mean   :0.3859   Mean   :0.3821  
##  3rd Qu.:0.5409   3rd Qu.:0.5962   3rd Qu.:0.5902   3rd Qu.:0.5564  
##  Max.   :0.9708   Max.   :0.9647   Max.   :1.0000   Max.   :1.0000  
##      attr37           attr38           attr39           attr40      
##  Min.   :0.0351   Min.   :0.0383   Min.   :0.0371   Min.   :0.0117  
##  1st Qu.:0.1644   1st Qu.:0.1736   1st Qu.:0.1694   1st Qu.:0.1848  
##  Median :0.3201   Median :0.3101   Median :0.2829   Median :0.2729  
##  Mean   :0.3602   Mean   :0.3337   Mean   :0.3170   Mean   :0.3020  
##  3rd Qu.:0.5144   3rd Qu.:0.4374   3rd Qu.:0.4145   3rd Qu.:0.4158  
##  Max.   :0.9497   Max.   :1.0000   Max.   :0.9857   Max.   :0.9167  
##      attr41           attr42           attr43           attr44      
##  Min.   :0.0360   Min.   :0.0056   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.1581   1st Qu.:0.1466   1st Qu.:0.1552   1st Qu.:0.1262  
##  Median :0.2558   Median :0.2331   Median :0.2211   Median :0.1749  
##  Mean   :0.2780   Mean   :0.2660   Mean   :0.2410   Mean   :0.2113  
##  3rd Qu.:0.3717   3rd Qu.:0.3712   3rd Qu.:0.3141   3rd Qu.:0.2694  
##  Max.   :0.7322   Max.   :0.8246   Max.   :0.7517   Max.   :0.5772  
##      attr45           attr46           attr47           attr48       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0944   1st Qu.:0.0690   1st Qu.:0.0621   1st Qu.:0.04200  
##  Median :0.1467   Median :0.1234   Median :0.1043   Median :0.07450  
##  Mean   :0.1962   Mean   :0.1596   Mean   :0.1190   Mean   :0.08803  
##  3rd Qu.:0.2341   3rd Qu.:0.2001   3rd Qu.:0.1492   3rd Qu.:0.11640  
##  Max.   :0.7034   Max.   :0.7292   Max.   :0.5522   Max.   :0.33390  
##      attr49            attr50            attr51           attr52       
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00080  
##  1st Qu.:0.02420   1st Qu.:0.01200   1st Qu.:0.0086   1st Qu.:0.00780  
##  Median :0.04220   Median :0.01850   Median :0.0140   Median :0.01180  
##  Mean   :0.05022   Mean   :0.02051   Mean   :0.0162   Mean   :0.01372  
##  3rd Qu.:0.06810   3rd Qu.:0.02650   3rd Qu.:0.0209   3rd Qu.:0.01670  
##  Max.   :0.16080   Max.   :0.06370   Max.   :0.1004   Max.   :0.07090  
##      attr53            attr54            attr55             attr56        
##  Min.   :0.00050   Min.   :0.00110   Min.   :0.000600   Min.   :0.000600  
##  1st Qu.:0.00490   1st Qu.:0.00550   1st Qu.:0.003900   1st Qu.:0.004800  
##  Median :0.00930   Median :0.00930   Median :0.007400   Median :0.007300  
##  Mean   :0.01036   Mean   :0.01113   Mean   :0.009145   Mean   :0.008222  
##  3rd Qu.:0.01430   3rd Qu.:0.01450   3rd Qu.:0.012100   3rd Qu.:0.011100  
##  Max.   :0.03170   Max.   :0.03520   Max.   :0.037600   Max.   :0.032600  
##      attr57             attr58             attr59           attr60        
##  Min.   :0.000300   Min.   :0.000300   Min.   :0.0001   Min.   :0.000600  
##  1st Qu.:0.003800   1st Qu.:0.003700   1st Qu.:0.0040   1st Qu.:0.003300  
##  Median :0.006500   Median :0.006300   Median :0.0067   Median :0.005400  
##  Mean   :0.007862   Mean   :0.008136   Mean   :0.0082   Mean   :0.006731  
##  3rd Qu.:0.010500   3rd Qu.:0.010700   3rd Qu.:0.0109   3rd Qu.:0.008700  
##  Max.   :0.025800   Max.   :0.037700   Max.   :0.0364   Max.   :0.043900  
##  targetVar
##  M:84     
##  R:73     
##           
##           
##           
## 

2.a.v) Summarize the levels of the class attribute

cbind(freq=table(y_train), percentage=prop.table(table(y_train))*100)
##   freq percentage
## M   84   53.50318
## R   73   46.49682

2.b) Data visualizations

# Boxplots for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    boxplot(X_train[,i], main=names(X_train)[i])
}

# Histograms each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    hist(X_train[,i], main=names(X_train)[i])
}

# Density plot for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    plot(density(X_train[,i]), main=names(X_train)[i])
}

# Correlation matrix
correlations <- cor(X_train)
corrplot(correlations, method="circle")

if (notifyStatus) email_notify(paste("Data Summarization and Visualization completed!",date()))

3. Prepare Data

Some dataset may require additional preparation activities that will best exposes the structure of the problem and the relationships between the input attributes and the output variable. Some data-prep tasks might include:

if (notifyStatus) email_notify(paste("Data Cleaning and Transformation has begun!",date()))

3.a) Data Transforms

# Not applicable for this iteration of the project
# Sample code for performing SMOTE transformation on the training data

# set.seed(seedNum)
# Xy_train <- SMOTE(targetVar ~., data=Xy_train, perc.over=200, perc.under=300)
# totCol <- ncol(Xy_train)
# y_train <- Xy_train[,totCol]
# cbind(freq=table(y_train), percentage=prop.table(table(y_train))*100)

3.b) Feature Selection

# Not applicable for this iteration of the project
# Sample Code for finding collinear features (Block #1 of 2)

# Using the correlations calculated previously, we try to find attributes that are highly correlated.
# highlyCorrelated <- findCorrelation(correlations, cutoff=0.85)
# print(highlyCorrelated)
# cat('Number of attributes found to be highly correlated:',length(highlyCorrelated))
# Sample Code for finding collinear features (Block #2 of 2)

# Removing the highly correlated attributes from the training and validation dataframes
# Xy_train <- Xy_train[, -highlyCorrelated]
# Xy_test <- Xy_test[, -highlyCorrelated]
# Not applicable for this iteration of the project
# Sample code for performing Attribute Importance Ranking (Block #1 of 3)

# startTimeModule <- proc.time()
# set.seed(seedNum)
# library(gbm)
# model_fs <- train(targetVar~., data=Xy_train, method="gbm", preProcess="scale", trControl=control, verbose=F)
# rankedImportance <- varImp(model_fs, scale=FALSE)
# print(rankedImportance)
# plot(rankedImportance)
# Sample code for performing Attribute Importance Ranking (Block #2 of 3)

# Set the importance threshold and calculate the list of attributes that don't contribute to the importance threshold
# maxThreshold <- 0.99
# rankedAttributes <- rankedImportance$importance
# rankedAttributes <- rankedAttributes[order(-rankedAttributes$Overall),,drop=FALSE]
# totalWeight <- sum(rankedAttributes)
# i <- 1
# accumWeight <- 0
# exit_now <- FALSE
# while ((i <= totAttr) & !exit_now) {
#   accumWeight = accumWeight + rankedAttributes[i,]
#   if ((accumWeight/totalWeight) >= maxThreshold) {
#     exit_now <- TRUE
#   } else {
#     i <- i + 1
#   }
# }
# lowImportance <- rankedAttributes[(i+1):(totAttr),,drop=FALSE]
# lowAttributes <- rownames(lowImportance)
# cat('Number of attributes contributed to the importance threshold:',i,"\n")
# cat('Number of attributes found to be of low importance:',length(lowAttributes))
# Sample code for performing Attribute Importance Ranking (Block #3 of 3)

# Removing the unselected attributes from the training and validation dataframes
# Xy_train <- Xy_train[, !(names(Xy_train) %in% lowAttributes)]
# Xy_test <- Xy_test[, !(names(Xy_test) %in% lowAttributes)]
# Not applicable for this iteration of the project
# Sample code for performing Recursive Feature Elimination (Block #1 of 2)

# startTimeModule <- proc.time()
# set.seed(seedNum)
# rfeCTRL <- rfeControl(functions=rfFuncs, method="cv", number=10)
# rfeResults <- rfe(Xy_train[,1:totAttr], Xy_train[,totCol], sizes=c(30:55), rfeControl=rfeCTRL)
# print(rfeResults)
# rfeAttributes <- predictors(rfeResults)
# cat('Number of attributes identified from the RFE algorithm:',length(rfeAttributes))
# print(rfeAttributes)
# plot(rfeResults, type=c("g", "o"))
# Sample code for performing Recursive Feature Elimination (Block #2 of 3)

# Removing the unselected attributes from the training and validation dataframes
# rfeAttributes <- c(rfeAttributes,"targetVar")
# Xy_train <- Xy_train[, (names(Xy_train) %in% rfeAttributes)]
# Xy_test <- Xy_test[, (names(Xy_test) %in% rfeAttributes)]

3.c) Display the Final Datasets for Model-Building

# We finalize the training and testing datasets for the modeling activities
dim(Xy_train)
## [1] 157  61
dim(Xy_test)
## [1] 51 61
if (notifyStatus) email_notify(paste("Data Cleaning and Transformation completed!",date()))
proc.time()-startTimeScript
##    user  system elapsed 
##  25.377   0.382  25.824

4. Model and Evaluate Algorithms

After the data-prep, we next work on finding a workable model by evaluating a subset of machine learning algorithms that are good at exploiting the structure of the training. The typical evaluation tasks include:

For this project, we will evaluate one linear, one non-linear, and three ensemble algorithms:

Linear Algorithm: Logistic Regression

Non-Linear Algorithm: Decision Trees (CART)

Ensemble Algorithms: Bagged CART, Random Forest, and Gradient Boosting

The random number seed is reset before each run to ensure that the evaluation of each algorithm is performed using the same data splits. It ensures the results are directly comparable.

4.a) Generate models using linear algorithms

# Logistic Regression (Classification)
if (notifyStatus) email_notify(paste("Logistic Regression modeling has begun!",date()))
startTimeModule <- proc.time()
set.seed(seedNum)
fit.glm <- train(targetVar~., data=Xy_train, method="glm", metric=metricTarget, trControl=control)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
print(fit.glm)
## Generalized Linear Model 
## 
## 157 samples
##  60 predictor
##   2 classes: 'M', 'R' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 141, 142, 141, 142, 141, 142, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.6938235  0.3878228
proc.time()-startTimeModule
##    user  system elapsed 
##   0.962   0.048   0.988
if (notifyStatus) email_notify(paste("Logistic Regression modeling completed!",date()))

4.b) Generate models using nonlinear algorithms

# Decision Tree - CART (Regression/Classification)
if (notifyStatus) email_notify(paste("Decision Tree modeling has begun!",date()))
startTimeModule <- proc.time()
set.seed(seedNum)
fit.cart <- train(targetVar~., data=Xy_train, method="rpart", metric=metricTarget, trControl=control)
print(fit.cart)
## CART 
## 
## 157 samples
##  60 predictor
##   2 classes: 'M', 'R' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 141, 142, 141, 142, 141, 142, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.04109589  0.7022059  0.4002768
##   0.07534247  0.7205392  0.4338938
##   0.52054795  0.5913725  0.1316708
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.07534247.
proc.time()-startTimeModule
##    user  system elapsed 
##   1.000   0.158   0.982
if (notifyStatus) email_notify(paste("Decision Tree modeling completed!",date()))

4.c) Generate models using ensemble algorithms

In this section, we will explore the use and tuning of ensemble algorithms to see whether we can improve the results.

# Bagged CART (Regression/Classification)
if (notifyStatus) email_notify(paste("Bagged CART modeling has begun!",date()))
startTimeModule <- proc.time()
set.seed(seedNum)
fit.bagcart <- train(targetVar~., data=Xy_train, method="treebag", metric=metricTarget, trControl=control)
print(fit.bagcart)
## Bagged CART 
## 
## 157 samples
##  60 predictor
##   2 classes: 'M', 'R' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 141, 142, 141, 142, 141, 142, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7839216  0.5589413
proc.time()-startTimeModule
##    user  system elapsed 
##   4.403   0.320   4.354
if (notifyStatus) email_notify(paste("Bagged CART modeling completed!",date()))
# Random Forest (Regression/Classification)
if (notifyStatus) email_notify(paste("Random Forest modeling has begun!",date()))
startTimeModule <- proc.time()
set.seed(seedNum)
fit.rf <- train(targetVar~., data=Xy_train, method="rf", metric=metricTarget, trControl=control)
print(fit.rf)
## Random Forest 
## 
## 157 samples
##  60 predictor
##   2 classes: 'M', 'R' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 141, 142, 141, 142, 141, 142, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8544363  0.7031428
##   31    0.7913725  0.5731117
##   60    0.7788725  0.5468892
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
proc.time()-startTimeModule
##    user  system elapsed 
##   7.367   0.094   7.408
if (notifyStatus) email_notify(paste("Random Forest modeling completed!",date()))
# Gradient Boosting (Regression/Classification)
if (notifyStatus) email_notify(paste("Gradient Boosting modeling has begun!",date()))
startTimeModule <- proc.time()
set.seed(seedNum)
fit.gbm <- train(targetVar~., data=Xy_train, method="xgbTree", metric=metricTarget, trControl=control, verbose=F)
# fit.gbm <- train(targetVar~., data=Xy_train, method="gbm", metric=metricTarget, trControl=control, verbose=F)
print(fit.gbm)
## eXtreme Gradient Boosting 
## 
## 157 samples
##  60 predictor
##   2 classes: 'M', 'R' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 141, 142, 141, 142, 141, 142, ... 
## Resampling results across tuning parameters:
## 
##   eta  max_depth  colsample_bytree  subsample  nrounds  Accuracy 
##   0.3  1          0.6               0.50        50      0.8168873
##   0.3  1          0.6               0.50       100      0.8168873
##   0.3  1          0.6               0.50       150      0.8035539
##   0.3  1          0.6               0.75        50      0.8094363
##   0.3  1          0.6               0.75       100      0.8230882
##   0.3  1          0.6               0.75       150      0.8160539
##   0.3  1          0.6               1.00        50      0.7800735
##   0.3  1          0.6               1.00       100      0.8047549
##   0.3  1          0.6               1.00       150      0.8176716
##   0.3  1          0.8               0.50        50      0.7836029
##   0.3  1          0.8               0.50       100      0.8231373
##   0.3  1          0.8               0.50       150      0.8098039
##   0.3  1          0.8               0.75        50      0.8039706
##   0.3  1          0.8               0.75       100      0.8168382
##   0.3  1          0.8               0.75       150      0.8160539
##   0.3  1          0.8               1.00        50      0.8117892
##   0.3  1          0.8               1.00       100      0.8235049
##   0.3  1          0.8               1.00       150      0.8176716
##   0.3  2          0.6               0.50        50      0.8239216
##   0.3  2          0.6               0.50       100      0.8176716
##   0.3  2          0.6               0.50       150      0.8243382
##   0.3  2          0.6               0.75        50      0.7985049
##   0.3  2          0.6               0.75       100      0.8293382
##   0.3  2          0.6               0.75       150      0.8230392
##   0.3  2          0.6               1.00        50      0.8110049
##   0.3  2          0.6               1.00       100      0.8110049
##   0.3  2          0.6               1.00       150      0.8364216
##   0.3  2          0.8               0.50        50      0.8231373
##   0.3  2          0.8               0.50       100      0.8548529
##   0.3  2          0.8               0.50       150      0.8481863
##   0.3  2          0.8               0.75        50      0.8176225
##   0.3  2          0.8               0.75       100      0.8360049
##   0.3  2          0.8               0.75       150      0.8168382
##   0.3  2          0.8               1.00        50      0.8168382
##   0.3  2          0.8               1.00       100      0.8301716
##   0.3  2          0.8               1.00       150      0.8426716
##   0.3  3          0.6               0.50        50      0.8239216
##   0.3  3          0.6               0.50       100      0.8239216
##   0.3  3          0.6               0.50       150      0.8301716
##   0.3  3          0.6               0.75        50      0.8540196
##   0.3  3          0.6               0.75       100      0.8481863
##   0.3  3          0.6               0.75       150      0.8606863
##   0.3  3          0.6               1.00        50      0.8376225
##   0.3  3          0.6               1.00       100      0.8305392
##   0.3  3          0.6               1.00       150      0.8489216
##   0.3  3          0.8               0.50        50      0.7988725
##   0.3  3          0.8               0.50       100      0.8047059
##   0.3  3          0.8               0.50       150      0.7980392
##   0.3  3          0.8               0.75        50      0.8165196
##   0.3  3          0.8               0.75       100      0.8298039
##   0.3  3          0.8               0.75       150      0.8235539
##   0.3  3          0.8               1.00        50      0.8047549
##   0.3  3          0.8               1.00       100      0.8114216
##   0.3  3          0.8               1.00       150      0.8055392
##   0.4  1          0.6               0.50        50      0.8102206
##   0.4  1          0.6               0.50       100      0.8172059
##   0.4  1          0.6               0.50       150      0.8418873
##   0.4  1          0.6               0.75        50      0.7926225
##   0.4  1          0.6               0.75       100      0.7984559
##   0.4  1          0.6               0.75       150      0.8109559
##   0.4  1          0.6               1.00        50      0.7984559
##   0.4  1          0.6               1.00       100      0.8176716
##   0.4  1          0.6               1.00       150      0.8239216
##   0.4  1          0.8               0.50        50      0.8106373
##   0.4  1          0.8               0.50       100      0.7789216
##   0.4  1          0.8               0.50       150      0.7847549
##   0.4  1          0.8               0.75        50      0.7804902
##   0.4  1          0.8               0.75       100      0.8176716
##   0.4  1          0.8               0.75       150      0.8110049
##   0.4  1          0.8               1.00        50      0.7918382
##   0.4  1          0.8               1.00       100      0.8047549
##   0.4  1          0.8               1.00       150      0.8051225
##   0.4  2          0.6               0.50        50      0.8297549
##   0.4  2          0.6               0.50       100      0.8544363
##   0.4  2          0.6               0.50       150      0.8481373
##   0.4  2          0.6               0.75        50      0.8152206
##   0.4  2          0.6               0.75       100      0.8414706
##   0.4  2          0.6               0.75       150      0.8348039
##   0.4  2          0.6               1.00        50      0.8301716
##   0.4  2          0.6               1.00       100      0.8485539
##   0.4  2          0.6               1.00       150      0.8543873
##   0.4  2          0.8               0.50        50      0.8239216
##   0.4  2          0.8               0.50       100      0.8426716
##   0.4  2          0.8               0.50       150      0.8356373
##   0.4  2          0.8               0.75        50      0.8101716
##   0.4  2          0.8               0.75       100      0.8097549
##   0.4  2          0.8               0.75       150      0.8164216
##   0.4  2          0.8               1.00        50      0.8427206
##   0.4  2          0.8               1.00       100      0.8427206
##   0.4  2          0.8               1.00       150      0.8489706
##   0.4  3          0.6               0.50        50      0.8355882
##   0.4  3          0.6               0.50       100      0.8547549
##   0.4  3          0.6               0.50       150      0.8668873
##   0.4  3          0.6               0.75        50      0.8611029
##   0.4  3          0.6               0.75       100      0.8611029
##   0.4  3          0.6               0.75       150      0.8423529
##   0.4  3          0.6               1.00        50      0.8560049
##   0.4  3          0.6               1.00       100      0.8626716
##   0.4  3          0.6               1.00       150      0.8626716
##   0.4  3          0.8               0.50        50      0.8223529
##   0.4  3          0.8               0.50       100      0.8348529
##   0.4  3          0.8               0.50       150      0.8286029
##   0.4  3          0.8               0.75        50      0.8360539
##   0.4  3          0.8               0.75       100      0.8235049
##   0.4  3          0.8               0.75       150      0.8235049
##   0.4  3          0.8               1.00        50      0.8114216
##   0.4  3          0.8               1.00       100      0.8176716
##   0.4  3          0.8               1.00       150      0.8239216
##   Kappa    
##   0.6315823
##   0.6329939
##   0.6048946
##   0.6147303
##   0.6418262
##   0.6280605
##   0.5571303
##   0.6057445
##   0.6314499
##   0.5654489
##   0.6446709
##   0.6159590
##   0.6035579
##   0.6298326
##   0.6289612
##   0.6198738
##   0.6444349
##   0.6314499
##   0.6464896
##   0.6332833
##   0.6465555
##   0.5931558
##   0.6571509
##   0.6449065
##   0.6185050
##   0.6185639
##   0.6696853
##   0.6417337
##   0.7073266
##   0.6940270
##   0.6297837
##   0.6693100
##   0.6298126
##   0.6299186
##   0.6570062
##   0.6840248
##   0.6411224
##   0.6427948
##   0.6536403
##   0.7050803
##   0.6927612
##   0.7185875
##   0.6685572
##   0.6556991
##   0.6947216
##   0.5943765
##   0.6089801
##   0.5954543
##   0.6279501
##   0.6545052
##   0.6420052
##   0.6032361
##   0.6165039
##   0.6051111
##   0.6178218
##   0.6321596
##   0.6836622
##   0.5812686
##   0.5934809
##   0.6201130
##   0.5948287
##   0.6322755
##   0.6447755
##   0.6149270
##   0.5512020
##   0.5653241
##   0.5561618
##   0.6315188
##   0.6194669
##   0.5814255
##   0.6048712
##   0.6059784
##   0.6589752
##   0.7081875
##   0.6949773
##   0.6246251
##   0.6763298
##   0.6649650
##   0.6562322
##   0.6942634
##   0.7051074
##   0.6406182
##   0.6801589
##   0.6686234
##   0.6142405
##   0.6110694
##   0.6271997
##   0.6805868
##   0.6818163
##   0.6943163
##   0.6673660
##   0.7068641
##   0.7307625
##   0.7178303
##   0.7178303
##   0.6803367
##   0.7076643
##   0.7203720
##   0.7203720
##   0.6411488
##   0.6685886
##   0.6560886
##   0.6686832
##   0.6440169
##   0.6440169
##   0.6148372
##   0.6290449
##   0.6427942
## 
## Tuning parameter 'gamma' was held constant at a value of 0
## 
## Tuning parameter 'min_child_weight' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 150, max_depth = 3,
##  eta = 0.4, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1
##  and subsample = 0.5.
proc.time()-startTimeModule
##    user  system elapsed 
##  39.057   3.010  22.367
if (notifyStatus) email_notify(paste("Gradient Boosting modeling completed!",date()))

4.d) Compare baseline algorithms

results <- resamples(list(LR=fit.glm, CART=fit.cart, BagCART=fit.bagcart, RF=fit.rf, GBM=fit.gbm))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: LR, CART, BagCART, RF, GBM 
## Number of resamples: 10 
## 
## Accuracy 
##              Min.   1st Qu.    Median      Mean  3rd Qu.   Max. NA's
## LR      0.5882353 0.6354167 0.6666667 0.6938235 0.734375 0.9375    0
## CART    0.5625000 0.6519608 0.6770833 0.7205392 0.800000 1.0000    0
## BagCART 0.5625000 0.7127451 0.7666667 0.7839216 0.853125 1.0000    0
## RF      0.7500000 0.8031250 0.8180147 0.8544363 0.918750 1.0000    0
## GBM     0.6875000 0.7901961 0.8708333 0.8668873 0.937500 1.0000    0
## 
## Kappa 
##               Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## LR      0.15602837 0.2772349 0.3360752 0.3878228 0.4687500 0.8709677    0
## CART    0.06666667 0.2967317 0.3614130 0.4338938 0.6033107 1.0000000    0
## BagCART 0.09677419 0.4115087 0.5377271 0.5589413 0.7053571 1.0000000    0
## RF      0.47540984 0.6045532 0.6341783 0.7031428 0.8361486 1.0000000    0
## GBM     0.35483871 0.5776515 0.7324888 0.7307625 0.8750000 1.0000000    0
dotplot(results)

cat('The average accuracy from all models is:',
    mean(c(results$values$`LR~Accuracy`,results$values$`CART~Accuracy`,results$values$`BagCART~Accuracy`,results$values$`RF~Accuracy`,results$values$`GBM~Accuracy`)))
## The average accuracy from all models is: 0.7839216

5. Improve Accuracy or Results

After we achieve a short list of machine learning algorithms with good level of accuracy, we can leverage ways to improve the accuracy of the models.

Using the three best-perfoming algorithms from the previous section, we will Search for a combination of parameters for each algorithm that yields the best results.

5.a) Algorithm Tuning

Finally, we will tune the best-performing algorithms from each group further and see whether we can get more accuracy out of them.

# Tuning algorithm #1 - Random Forest
if (notifyStatus) email_notify(paste("Algorithm #1 tuning has begun!",date()))
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(mtry=c(2,15,30,45,60))
fit.final1 <- train(targetVar~., data=Xy_train, method="rf", metric=metricTarget, trControl=control)
print(fit.final1)
## Random Forest 
## 
## 157 samples
##  60 predictor
##   2 classes: 'M', 'R' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 141, 142, 141, 142, 141, 142, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8544363  0.7031428
##   31    0.7913725  0.5731117
##   60    0.7788725  0.5468892
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
proc.time()-startTimeModule
##    user  system elapsed 
##   7.231   0.021   7.267
if (notifyStatus) email_notify(paste("Algorithm #1 tuning completed!",date()))
# Tuning algorithm #2 - Gradient Boosting
if (notifyStatus) email_notify(paste("Algorithm #2 tuning has begun!",date()))
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(nrounds=c(100,200,300,400,500), max_depth=3, eta=0.4, gamma=0, colsample_bytree=0.6, min_child_weight=1, subsample=0.5)
fit.final2 <- train(targetVar~., data=Xy_train, method="xgbTree", metric=metricTarget, tuneGrid=grid, trControl=control, verbose=F)
plot(fit.final2)

print(fit.final2)
## eXtreme Gradient Boosting 
## 
## 157 samples
##  60 predictor
##   2 classes: 'M', 'R' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 141, 142, 141, 142, 141, 142, ... 
## Resampling results across tuning parameters:
## 
##   nrounds  Accuracy   Kappa    
##   100      0.7965196  0.5896322
##   200      0.7706863  0.5401764
##   300      0.7706863  0.5401764
##   400      0.7714706  0.5408377
##   500      0.7711029  0.5384675
## 
## Tuning parameter 'max_depth' was held constant at a value of 3
##  0.6
## Tuning parameter 'min_child_weight' was held constant at a value of
##  1
## Tuning parameter 'subsample' was held constant at a value of 0.5
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 100, max_depth = 3,
##  eta = 0.4, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1
##  and subsample = 0.5.
proc.time()-startTimeModule
##    user  system elapsed 
##   3.731   0.199   2.380
if (notifyStatus) email_notify(paste("Algorithm #2 tuning completed!",date()))

5.d) Compare Algorithms After Tuning

results <- resamples(list(RF=fit.final1, GBM=fit.final2))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: RF, GBM 
## Number of resamples: 10 
## 
## Accuracy 
##      Min.  1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RF  0.750 0.803125 0.8180147 0.8544363 0.9187500 1.0000000    0
## GBM 0.625 0.737500 0.8125000 0.7965196 0.8558824 0.9333333    0
## 
## Kappa 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RF  0.4754098 0.6045532 0.6341783 0.7031428 0.8361486 1.0000000    0
## GBM 0.2380952 0.4850848 0.6250000 0.5896322 0.7050290 0.8672566    0
dotplot(results)

6. Finalize Model and Present Results

Once we have narrow down to a model that we believe can make accurate predictions on unseen data, we are ready to finalize it. Finalizing a model may involve sub-tasks such as:

if (notifyStatus) email_notify(paste("Model Validation and Final Model Creation has begun!",date()))

6.a) Predictions on validation dataset

predictions <- predict(fit.final1, newdata=Xy_test)
confusionMatrix(predictions, y_test)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  M  R
##          M 23  6
##          R  4 18
##                                           
##                Accuracy : 0.8039          
##                  95% CI : (0.6688, 0.9018)
##     No Information Rate : 0.5294          
##     P-Value [Acc > NIR] : 4.341e-05       
##                                           
##                   Kappa : 0.6047          
##                                           
##  Mcnemar's Test P-Value : 0.7518          
##                                           
##             Sensitivity : 0.8519          
##             Specificity : 0.7500          
##          Pos Pred Value : 0.7931          
##          Neg Pred Value : 0.8182          
##              Prevalence : 0.5294          
##          Detection Rate : 0.4510          
##    Detection Prevalence : 0.5686          
##       Balanced Accuracy : 0.8009          
##                                           
##        'Positive' Class : M               
## 
pred <- prediction(as.numeric(predictions), as.numeric(y_test))
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, colorize=TRUE)

auc <- performance(pred, measure = "auc")
auc <- auc@y.values[[1]]
cat('Area under the curve is:', auc)
## Area under the curve is: 0.8009259
predictions <- predict(fit.final2, newdata=Xy_test)
confusionMatrix(predictions, y_test)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  M  R
##          M 19  4
##          R  8 20
##                                           
##                Accuracy : 0.7647          
##                  95% CI : (0.6251, 0.8721)
##     No Information Rate : 0.5294          
##     P-Value [Acc > NIR] : 0.0004667       
##                                           
##                   Kappa : 0.5321          
##                                           
##  Mcnemar's Test P-Value : 0.3864762       
##                                           
##             Sensitivity : 0.7037          
##             Specificity : 0.8333          
##          Pos Pred Value : 0.8261          
##          Neg Pred Value : 0.7143          
##              Prevalence : 0.5294          
##          Detection Rate : 0.3725          
##    Detection Prevalence : 0.4510          
##       Balanced Accuracy : 0.7685          
##                                           
##        'Positive' Class : M               
## 
pred <- prediction(as.numeric(predictions), as.numeric(y_test))
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, colorize=TRUE)

auc <- performance(pred, measure = "auc")
auc <- auc@y.values[[1]]
cat('Area under the curve is:', auc)
## Area under the curve is: 0.7685185

6.b) Create standalone model on entire training dataset

startTimeModule <- proc.time()
set.seed(seedNum)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
# Combining datasets to form a complete dataset that will be used to train the final model
Xy_complete <- rbind(Xy_train, Xy_test)

finalModel <- randomForest(targetVar~., Xy_complete, mtry=2, na.action=na.omit)
summary(finalModel)
##                 Length Class  Mode     
## call               5   -none- call     
## type               1   -none- character
## predicted        208   factor numeric  
## err.rate        1500   -none- numeric  
## confusion          6   -none- numeric  
## votes            416   matrix numeric  
## oob.times        208   -none- numeric  
## classes            2   -none- character
## importance        60   -none- numeric  
## importanceSD       0   -none- NULL     
## localImportance    0   -none- NULL     
## proximity          0   -none- NULL     
## ntree              1   -none- numeric  
## mtry               1   -none- numeric  
## forest            14   -none- list     
## y                208   factor numeric  
## test               0   -none- NULL     
## inbag              0   -none- NULL     
## terms              3   terms  call
proc.time()-startTimeModule
##    user  system elapsed 
##   0.291   0.001   0.293

6.c) Save model for later use

#saveRDS(finalModel, "./finalModel_BinaryClass.rds")
if (notifyStatus) email_notify(paste("Model Validation and Final Model Creation Completed!",date()))
proc.time()-startTimeScript
##    user  system elapsed 
##  90.704   4.291  73.156